After conducting our analysis, what we found to be the strongest indicators of Airbnb prices in San Francisco were: bedrooms, property type, and the number of reviews. This can be quantitatively shown by the very low p-values in our Final Model, which we shall see later. This is consistent with what we may reasonably have expected: the more the number of bedrooms, and the more luxurious the property type the higher the price, and the greater the number of reviews, the lower the price.
You may wish to have a level 1 header (#) for your EDA, then use level 2 sub-headers (##) to make sure you cover all three EDA bases. At a minimum you should address these questions:
At this stage, you may also find you want to use filter, mutate, arrange, select, or count. Let your questions lead you!
glimpse(listings)Rows: 6,566
Columns: 74
$ id <dbl> 958, 5858, 7918, 8142, 83~
$ listing_url <chr> "https://www.airbnb.com/r~
$ scrape_id <dbl> 2.021101e+13, 2.021101e+1~
$ last_scraped <date> 2021-10-06, 2021-10-06, ~
$ name <chr> "Bright, Modern Garden Un~
$ description <chr> "Please check local laws ~
$ neighborhood_overview <chr> "Quiet cul de sac in frie~
$ picture_url <chr> "https://a0.muscache.com/~
$ host_id <dbl> 1169, 8904, 21994, 21994,~
$ host_url <chr> "https://www.airbnb.com/u~
$ host_name <chr> "Holly", "Philip And Tani~
$ host_since <date> 2008-07-31, 2009-03-02, ~
$ host_location <chr> "San Francisco, Californi~
$ host_about <chr> "We are a family of four ~
$ host_response_time <chr> "within an hour", "N/A", ~
$ host_response_rate <chr> "100%", "N/A", "100%", "1~
$ host_acceptance_rate <chr> "92%", "N/A", "100%", "10~
$ host_is_superhost <lgl> TRUE, FALSE, FALSE, FALSE~
$ host_thumbnail_url <chr> "https://a0.muscache.com/~
$ host_picture_url <chr> "https://a0.muscache.com/~
$ host_neighbourhood <chr> "Duboce Triangle", "Berna~
$ host_listings_count <dbl> 1, 2, 10, 10, 2, 2, 1, 0,~
$ host_total_listings_count <dbl> 1, 2, 10, 10, 2, 2, 1, 0,~
$ host_verifications <chr> "['email', 'phone', 'face~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ neighbourhood <chr> "San Francisco, Californi~
$ neighbourhood_cleansed <chr> "Western Addition", "Bern~
$ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, N~
$ latitude <dbl> 37.77028, 37.74474, 37.76~
$ longitude <dbl> -122.4332, -122.4209, -12~
$ property_type <chr> "Entire serviced apartmen~
$ room_type <chr> "Entire home/apt", "Entir~
$ accommodates <dbl> 3, 5, 2, 2, 4, 3, 4, 2, 3~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text <chr> "1 bath", "1 bath", "4 sh~
$ bedrooms <dbl> 1, 2, 1, 1, 2, 1, 2, NA, ~
$ beds <dbl> 2, 3, 1, 1, 2, 1, 3, 1, 3~
$ amenities <chr> "[\"Keypad\", \"Refrigera~
$ price <chr> "$160.00", "$235.00", "$5~
$ minimum_nights <dbl> 2, 30, 32, 32, 7, 13, 30,~
$ maximum_nights <dbl> 30, 60, 60, 90, 111, 14, ~
$ minimum_minimum_nights <dbl> 2, 30, 32, 32, 7, 13, 30,~
$ maximum_minimum_nights <dbl> 2, 30, 32, 32, 7, 13, 30,~
$ minimum_maximum_nights <dbl> 1125, 60, 60, 90, 111, 14~
$ maximum_maximum_nights <dbl> 1125, 60, 60, 90, 111, 14~
$ minimum_nights_avg_ntm <dbl> 2, 30, 32, 32, 7, 13, 30,~
$ maximum_nights_avg_ntm <dbl> 1125, 60, 60, 90, 111, 14~
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30 <dbl> 6, 30, 30, 11, 30, 23, 4,~
$ availability_60 <dbl> 12, 60, 60, 41, 60, 47, 2~
$ availability_90 <dbl> 18, 90, 90, 71, 90, 77, 5~
$ availability_365 <dbl> 104, 365, 365, 346, 365, ~
$ calendar_last_scraped <date> 2021-10-06, 2021-10-06, ~
$ number_of_reviews <dbl> 302, 111, 19, 8, 28, 736,~
$ number_of_reviews_ltm <dbl> 40, 0, 0, 0, 0, 1, 2, 0, ~
$ number_of_reviews_l30d <dbl> 5, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review <date> 2014-10-05, 2009-11-24, ~
$ last_review <date> 2021-09-17, 2015-08-28, ~
$ review_scores_rating <dbl> 4.87, 4.88, 4.20, 4.63, 4~
$ review_scores_accuracy <dbl> 4.94, 4.85, 3.73, 4.38, 4~
$ review_scores_cleanliness <dbl> 4.95, 4.87, 3.87, 4.38, 5~
$ review_scores_checkin <dbl> 4.96, 4.89, 4.67, 4.75, 4~
$ review_scores_communication <dbl> 4.90, 4.85, 4.60, 4.75, 5~
$ review_scores_location <dbl> 4.98, 4.77, 4.73, 4.63, 4~
$ review_scores_value <dbl> 4.78, 4.68, 4.00, 4.63, 4~
$ license <chr> "City Registration Pendin~
$ instant_bookable <lgl> FALSE, FALSE, FALSE, FALS~
$ calculated_host_listings_count <dbl> 1, 1, 9, 9, 2, 2, 1, 1, 2~
$ calculated_host_listings_count_entire_homes <dbl> 1, 1, 0, 0, 2, 0, 1, 1, 2~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 9, 9, 0, 2, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month <dbl> 3.54, 0.77, 0.17, 0.10, 0~
skim(listings)| Name | listings |
| Number of rows | 6566 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 24 |
| Date | 5 |
| logical | 8 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 32 | 37 | 0 | 6566 | 0 |
| name | 0 | 1.00 | 2 | 94 | 0 | 6190 | 0 |
| description | 75 | 0.99 | 14 | 1000 | 0 | 5850 | 0 |
| neighborhood_overview | 1777 | 0.73 | 9 | 1000 | 0 | 3642 | 0 |
| picture_url | 0 | 1.00 | 60 | 126 | 0 | 6268 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 3402 | 0 |
| host_name | 14 | 1.00 | 1 | 42 | 0 | 1850 | 0 |
| host_location | 20 | 1.00 | 2 | 62 | 0 | 236 | 0 |
| host_about | 1932 | 0.71 | 1 | 3409 | 0 | 2356 | 3 |
| host_response_time | 14 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 14 | 1.00 | 2 | 4 | 0 | 48 | 0 |
| host_acceptance_rate | 14 | 1.00 | 2 | 4 | 0 | 92 | 0 |
| host_thumbnail_url | 14 | 1.00 | 55 | 106 | 0 | 3393 | 0 |
| host_picture_url | 14 | 1.00 | 57 | 109 | 0 | 3393 | 0 |
| host_neighbourhood | 418 | 0.94 | 3 | 31 | 0 | 162 | 0 |
| host_verifications | 0 | 1.00 | 4 | 152 | 0 | 248 | 0 |
| neighbourhood | 1777 | 0.73 | 28 | 54 | 0 | 6 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 6 | 21 | 0 | 36 | 0 |
| property_type | 0 | 1.00 | 4 | 35 | 0 | 52 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 10 | 1.00 | 6 | 17 | 0 | 30 | 0 |
| amenities | 0 | 1.00 | 27 | 1746 | 0 | 5585 | 0 |
| price | 0 | 1.00 | 5 | 10 | 0 | 593 | 0 |
| license | 2735 | 0.58 | 3 | 426 | 0 | 1648 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-10-06 | 2021-10-06 | 2021-10-06 | 1 |
| host_since | 14 | 1.00 | 2008-07-31 | 2021-09-28 | 2015-02-02 | 2127 |
| calendar_last_scraped | 0 | 1.00 | 2021-10-06 | 2021-10-06 | 2021-10-06 | 1 |
| first_review | 1397 | 0.79 | 2009-09-25 | 2021-10-04 | 2018-12-26 | 2181 |
| last_review | 1397 | 0.79 | 2010-10-04 | 2021-10-05 | 2021-07-10 | 1090 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 14 | 1 | 0.44 | FAL: 3670, TRU: 2882 |
| host_has_profile_pic | 14 | 1 | 0.99 | TRU: 6490, FAL: 62 |
| host_identity_verified | 14 | 1 | 0.85 | TRU: 5592, FAL: 960 |
| neighbourhood_group_cleansed | 6566 | 0 | NaN | : |
| bathrooms | 6566 | 0 | NaN | : |
| calendar_updated | 6566 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.99 | TRU: 6487, FAL: 79 |
| instant_bookable | 0 | 1 | 0.36 | FAL: 4205, TRU: 2361 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.714515e+07 | 16412399.00 | 9.580000e+02 | 1.289131e+07 | 2.836578e+07 | 4.156947e+07 | 5.263301e+07 | <U+2587><U+2585><U+2586><U+2587><U+2587> |
| scrape_id | 0 | 1.00 | 2.021101e+13 | 0.00 | 2.021101e+13 | 2.021101e+13 | 2.021101e+13 | 2.021101e+13 | 2.021101e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 8.383933e+07 | 109872537.10 | 1.169000e+03 | 4.562696e+06 | 2.648276e+07 | 1.225316e+08 | 4.250843e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| host_listings_count | 14 | 1.00 | 7.264000e+01 | 318.97 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.200000e+01 | 1.987000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 14 | 1.00 | 7.264000e+01 | 318.97 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.200000e+01 | 1.987000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 3.777000e+01 | 0.02 | 3.771000e+01 | 3.775000e+01 | 3.777000e+01 | 3.779000e+01 | 3.781000e+01 | <U+2582><U+2583><U+2586><U+2587><U+2585> |
| longitude | 0 | 1.00 | -1.224300e+02 | 0.03 | -1.225100e+02 | -1.224400e+02 | -1.224200e+02 | -1.224100e+02 | -1.223700e+02 | <U+2581><U+2582><U+2585><U+2587><U+2581> |
| accommodates | 0 | 1.00 | 3.090000e+00 | 1.83 | 0.000000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | <U+2587><U+2585><U+2581><U+2581><U+2581> |
| bedrooms | 933 | 0.86 | 1.510000e+00 | 0.86 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 9.000000e+00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 66 | 0.99 | 1.720000e+00 | 1.22 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.400000e+01 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 2.327000e+01 | 49.32 | 1.000000e+00 | 2.000000e+00 | 3.000000e+01 | 3.000000e+01 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 4.935600e+02 | 541.98 | 1.000000e+00 | 2.900000e+01 | 1.800000e+02 | 1.125000e+03 | 1.000000e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 2 | 1.00 | 2.410000e+01 | 55.20 | 1.000000e+00 | 2.000000e+00 | 3.000000e+01 | 3.000000e+01 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 2 | 1.00 | 3.962000e+01 | 116.87 | 1.000000e+00 | 2.000000e+00 | 3.000000e+01 | 3.000000e+01 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 2 | 1.00 | 6.874500e+02 | 548.24 | 1.000000e+00 | 7.000000e+01 | 1.125000e+03 | 1.125000e+03 | 1.000000e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 2 | 1.00 | 7.525390e+06 | 126905437.12 | 1.000000e+00 | 9.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 2 | 1.00 | 3.896000e+01 | 113.93 | 1.000000e+00 | 2.000000e+00 | 3.000000e+01 | 3.000000e+01 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 2 | 1.00 | 7.508364e+06 | 126618320.95 | 1.000000e+00 | 9.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.142625e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 8.960000e+00 | 11.09 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 1.700000e+01 | 3.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| availability_60 | 0 | 1.00 | 2.278000e+01 | 22.77 | 0.000000e+00 | 0.000000e+00 | 1.800000e+01 | 4.300000e+01 | 6.000000e+01 | <U+2587><U+2582><U+2582><U+2582><U+2583> |
| availability_90 | 0 | 1.00 | 3.915000e+01 | 34.13 | 0.000000e+00 | 0.000000e+00 | 3.600000e+01 | 7.000000e+01 | 9.000000e+01 | <U+2587><U+2582><U+2582><U+2583><U+2585> |
| availability_365 | 0 | 1.00 | 1.606500e+02 | 134.11 | 0.000000e+00 | 2.200000e+01 | 1.420000e+02 | 3.000000e+02 | 3.650000e+02 | <U+2587><U+2583><U+2582><U+2582><U+2586> |
| number_of_reviews | 0 | 1.00 | 4.422000e+01 | 84.37 | 0.000000e+00 | 1.000000e+00 | 7.000000e+00 | 4.600000e+01 | 8.610000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 6.140000e+00 | 15.27 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 4.000000e+00 | 3.730000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_l30d | 0 | 1.00 | 7.200000e-01 | 2.08 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 5.100000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 1397 | 0.79 | 4.730000e+00 | 0.56 | 0.000000e+00 | 4.710000e+00 | 4.890000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 1429 | 0.78 | 4.820000e+00 | 0.40 | 0.000000e+00 | 4.800000e+00 | 4.940000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 1429 | 0.78 | 4.760000e+00 | 0.43 | 0.000000e+00 | 4.710000e+00 | 4.910000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 1430 | 0.78 | 4.880000e+00 | 0.32 | 0.000000e+00 | 4.890000e+00 | 4.980000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 1429 | 0.78 | 4.860000e+00 | 0.37 | 1.000000e+00 | 4.880000e+00 | 4.980000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 1430 | 0.78 | 4.800000e+00 | 0.39 | 0.000000e+00 | 4.770000e+00 | 4.910000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 1430 | 0.78 | 4.660000e+00 | 0.45 | 0.000000e+00 | 4.580000e+00 | 4.760000e+00 | 4.900000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 1.510000e+01 | 32.60 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.000000e+01 | 1.510000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 1.071000e+01 | 31.45 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.510000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 3.760000e+00 | 9.93 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 5.600000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 4.200000e-01 | 2.87 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.600000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 1397 | 0.79 | 1.940000e+00 | 5.20 | 1.000000e-02 | 2.200000e-01 | 6.900000e-01 | 2.100000e+00 | 1.260000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
Comment: The dataset for San Francisco’s Airbnb listings has 74 variables with 6,566 rows. Of these variables, 37 are numeric, 24 are character, 8 are logical, and 5 are date. Of course, not all of these variables are integral to the creation of a price model, and we need to be able to separate the signal from the noise by tidying the data.
In our dataset, categorical or ‘factor’ variables include the 8 logical variables which only have values TRUE or FALSE, and certain character variables such as host_neighbourhood and neighbourhood_cleansed. Understanding which variables have a fixed and known set of possible values is important because it allows us to focus and restrict our analysis. In the case of categorical variables which are numeric, it also allows us to use linear instead of logarithmic scales.
In particular, we have decided on some specific variables of interest to conduct our regression analysis. They are: accommodates, bedrooms, beds, number_of_reviews, review_scores_rating, review_scores_value, minimum_nights, and maximum_nights.
favstats(listings$accommodates) min Q1 median Q3 max mean sd n missing
0 2 2 4 16 3.094883 1.833757 6566 0
favstats(listings$bedrooms) min Q1 median Q3 max mean sd n missing
1 1 1 2 9 1.514291 0.857692 5633 933
favstats(listings$beds) min Q1 median Q3 max mean sd n missing
0 1 1 2 14 1.723538 1.223217 6500 66
favstats(listings$number_of_reviews) min Q1 median Q3 max mean sd n missing
0 1 7 46 861 44.22373 84.36606 6566 0
favstats(listings$review_scores_rating) min Q1 median Q3 max mean sd n missing
0 4.71 4.89 5 5 4.73348 0.5575404 5169 1397
favstats(listings$review_scores_value) min Q1 median Q3 max mean sd n missing
0 4.58 4.76 4.9 5 4.66134 0.4450693 5136 1430
favstats(listings$minimum_nights) min Q1 median Q3 max mean sd n missing
1 2 30 30 1125 23.26592 49.32259 6566 0
favstats(listings$maximum_nights) min Q1 median Q3 max mean sd n missing
1 29 180 1125 10000 493.5577 541.9797 6566 0
listings_1 <- listings %>%
drop_na(c(host_is_superhost,host_has_profile_pic,host_identity_verified,instant_bookable,# logical variables
bedrooms,beds,review_scores_rating,reviews_per_month,# numerical variables
host_response_time,host_response_rate,host_acceptance_rate,bathrooms_text)) #char
# But we still have some "N/A"(not NA), so we need to drop them as well
na <- c(listings_1$host_response_time,listings_1$host_response_rate,listings_1$host_acceptance_rate)
n <- grep("N/A",na) # choose rows' index that contain "N/A"
listings_2 <- listings_1[-n,] %>%
mutate(price = parse_number(price),
host_response_rate=parse_number(host_response_rate),
host_acceptance_rate=parse_number(host_acceptance_rate))
#host_is_superhost = factor(host_is_superhost, levels = c("TRUE","FALSE")),
#host_identity_verified = factor(host_identity_verified, levels = c("TRUE","FALSE")))
listings_2 # A tibble: 3,467 x 74
id listing_url scrape_id last_scraped name description neighborhood_ov~
<dbl> <chr> <dbl> <date> <chr> <chr> <chr>
1 958 https://www~ 2.02e13 2021-10-06 Brigh~ "Please ch~ Quiet cul de sa~
2 7918 https://www~ 2.02e13 2021-10-06 A Fri~ "Nice and ~ Shopping old to~
3 8142 https://www~ 2.02e13 2021-10-06 Frien~ "Nice and ~ <NA>
4 8339 https://www~ 2.02e13 2021-10-06 Histo~ "Pls email~ <NA>
5 8739 https://www~ 2.02e13 2021-10-06 Missi~ "Welcome t~ Located between~
6 10820 https://www~ 2.02e13 2021-10-06 Haigh~ "This prop~ Neighborhood: H~
7 10824 https://www~ 2.02e13 2021-10-06 Victo~ "This prop~ Neighborhood: H~
8 10832 https://www~ 2.02e13 2021-10-06 Union~ "This prop~ Neighborhood: D~
9 12041 https://www~ 2.02e13 2021-10-06 Sunny~ "Nice and ~ Small shopping ~
10 12042 https://www~ 2.02e13 2021-10-06 Sunny~ "Settle do~ <NA>
# ... with 3,457 more rows, and 67 more variables: picture_url <chr>,
# host_id <dbl>, host_url <chr>, host_name <chr>, host_since <date>,
# host_location <chr>, host_about <chr>, host_response_time <chr>,
# host_response_rate <dbl>, host_acceptance_rate <dbl>,
# host_is_superhost <lgl>, host_thumbnail_url <chr>, host_picture_url <chr>,
# host_neighbourhood <chr>, host_listings_count <dbl>,
# host_total_listings_count <dbl>, host_verifications <chr>, ...
typeof(listings$price)[1] "character"
density_price_plot <- ggplot(listings_2, aes(x=price))+
geom_density()+
theme_bw()+
labs(title = "Price Distribution",
subtitle = "Density Plot",
x = "Price",
y = "Density")+
NULL
density_price_plot Comment: The density plot for price distribution is heavily skewed right in our visualisation which shows that the Airbnb prices in San Francisco all tend to hover around a similar range of numbers. This makes sense as Airbnb only offers rental services, so prices should not differ that drastically from one another. In the next graph, we can see how we can use logarithmic scales instead.
density_log_price <- ggplot(listings_2, aes(x=price)) +
geom_density()+
theme_bw()+
scale_x_log10()+
labs(title = "Price Distribution",
subtitle = "Density Plot",
x = "Log (Price) ",
y = "Density")+
NULL
density_log_price Comment: To better visualise this data, therefore, we can use a logarithmic scale which makes the data look like a more typical normal distribution which is still skewed right. The reason the graph is still skewed negatively is because there is a much higher probability of Airbnb prices being higher and expensive than them being close to 50 or cheaper.
availability_price <- ggplot(listings_2, aes(x = availability_30, y = log(price)))+
geom_col()+
theme_bw()+
labs(title = "Availability for 30 days vs Price",
subtitle = "Bar Chart",
x = "Rooms available within 30 days",
y = "Log (Price)")+
NULL
availability_pricehost_response_density <- ggplot(listings_2,aes(x=host_response_rate))+
geom_density()+
theme_bw()+
labs(title = "Host Response Rate",
subtitle = "Density Plot",
x = "Response rate",
y = "Density")+
NULL
host_response_densityhost_acceptance_density <- ggplot(listings_2,
aes(x=host_acceptance_rate))+
geom_density()+
theme_bw()+
labs(title = "Host Response Rate",
subtitle = "Density Plot",
x = "Host Response Rate",
y = "Density")+
NULL
host_acceptance_densitynumber_of_reviews_density <- ggplot(listings_2,aes(x=number_of_reviews), binwidth=5)+
geom_density()+
theme_bw()+
labs(title = "Number of Reviews",
subtitle = "Density Plot",
x = "Number of reviews",
y = "Density")+
NULL
number_of_reviews_densityrating_density <- ggplot(listings_2,aes(x=review_scores_rating))+
geom_density()+
theme_bw()+
labs(title = "Review Rating",
subtitle = "Density Plot",
x = "Rating",
y = "Density")+
NULL
rating_densityComment: Availability for 30 days vs Price: As we can see from this chart, the price of those properties available immediately is significantly higher than those only available in several days from now. This is in line with the basic economic proposition of supply and demand and how it affects prices, as the supply of rooms available immediately will be relatively small, while the people who embody the demand for these rooms will tend to be rather desperate and have little other choice. Therefore, the prices for rooms available immediately can be set much higher. In addition, we can see a spike for rooms available 30 days from now, as there similarly may be heightened demand for rooms further into the future as some organised people want to book their Airbnbs far in advance.
Host Response Rate Density Plot: The density plot for host response rates shows that the vast majority of host response rates are between 90% and 100%. If they were much less than 90%, it is unlikely that anyone using Airbnb would think their properties to be reliable enough to pay for them.
Number of Reviews Density Plot: The number of reviews density plot is heavily skewed to the right. This is because our axes go up to more than 750 and in reality, each individual property will not have more than 25 reviews. Writing reviews takes time and most consumers tend not to bother taking this time to write reviews unless they feel very passionately about their experience.
Review Rating Density Plot: Most ratings for the Airbnb properties are clustered around the 4.5-5 mark. If they were any less than this, most consumers would probably simply avoid them. Moreover, it is also possible that Airbnb has policies around taking off any listings below a certain number, maybe around 4, in the same way that Uber requires all of its drivers to be above a certain threshold to be able to continue driving for them.
superhost_price <- ggplot(listings_2, aes(x=log(price), y=host_is_superhost, fill= host_is_superhost))+
geom_boxplot()+
theme_bw()+
theme(legend.position = "none")+
labs(title = " Relationship between Superhost and Price ",
subtitle = "Box Plot",
x = "Log(Price)",
y = "Superhost")+
NULL
superhost_pricesuperhost_price_density <- listings_2 %>%
ggplot(aes(x=log(price), color= host_is_superhost))+
geom_density()+
facet_wrap(~host_is_superhost)+
theme_bw()+
theme(legend.position = "none")+
labs(title = "Relationship between Superhost and Price",
subtitle = "Density Plots",
x = "Log(Price)",
y = "Density")+
NULL
superhost_price_densitysuperhost_reviews <- ggplot(listings_2, aes(x=number_of_reviews, y=host_is_superhost, fill= host_is_superhost))+
geom_col()+
theme_bw()+
theme(legend.position = "none")+
labs(title = " Relationship between Superhost and reviews",
subtitle = "Bar Chart",
x = "Number of Reviews",
y = "Superhost")+
NULL
superhost_reviewssuperhost_rating <- ggplot(listings_2, aes(x=review_scores_rating, y=host_is_superhost, fill= host_is_superhost))+
geom_col()+
theme_bw()+
theme(legend.position = "none")+
labs(title = " Relationship between Superhost and ratings",
subtitle = "Bar Chart",
x = "Ratings",
y = "Superhost")+
NULL
superhost_rating Comment: Relationship between Superhost and Price: Somewhat surprisingly, the prices for properties set by superhosts versus non-superhosts is roughly similar. The distribution is also very similar as can be shown by both the box plot and density plot. What this could suggest is that superhosts on Airbnb do not have a significant impact on customers’ view on the properties. However, as we analyse the relationship between superhost and reviews and ratings, we can conclude that being a superhost does make a difference. One possible reason that superhosts and non-superhosts can still have properties at the same prices is that some customers are more sensitive to other factors such as location.
Relationship between Superhost and Reviews: Superhosts accumulate many more reviews than non-superhosts, with more than 150,000 compared to just over 50,000 for non-superhosts. If we look into Airbnb guidelines, this is consistent with the very high response rate required of Airbnb superhosts. Each superhost must maintain a response rate of 90% or higher, which may incentivise more people to leave reviews if they feel as if they are almost certain to receive a response.
Relationship between Superhost and Ratings: Superhosts also attract more ratings than non-superhosts. Again, this can be attributed to the closer relationship superhosts must strive to maintain with their guests in order to preserve their superhost designation.
host_verified_price <- ggplot(listings_2, aes(x=log(price), y=host_identity_verified, fill= host_identity_verified))+
geom_boxplot()+
theme_bw()+
theme(legend.position = "none")+
labs(title = "Relationship between a verified host and price",
subtitle = "Boxplot",
x = "Log(Price)",
y = "Verified Host")+
NULL
host_verified_pricehost_verified_price_density <- ggplot(listings_2, aes(x=log(price), colour = host_identity_verified))+
geom_density()+
facet_wrap(~host_identity_verified)+
theme_bw()+
theme(legend.position = "none")+
labs(title = "Relationship between a verified host and price ",
subtitle = "Density Plot",
x = "Log(Price)",
y = "Density")+
NULL
host_verified_price_densityhost_verified_reviews <- ggplot(listings_2, aes(x=number_of_reviews, y=host_identity_verified, fill= host_identity_verified))+
geom_col()+
theme_bw()+
theme(legend.position = "none")+
labs(title = "Relationship between a verified host and reviews ",
subtitle = "Bar Chart",
x = "Number of Reviews",
y = "Verified Host")+
NULL
host_verified_reviewshost_verified_rating <- ggplot(listings_2, aes(x=review_scores_rating, y=host_identity_verified, fill= host_identity_verified))+
geom_col()+
theme_bw()+
theme(legend.position = "none")+
labs(title = "Relationship between a verified host and ratings",
subtitle = "Bar Chart",
x = "Ratings",
y = "Verified Host")+
NULL
host_verified_rating Comment: Relationship between a verified host and price: Unlike with the relationship between superhosts and price, there is a marked difference in the prices a verified host and non-verified host can charge. This can be explained by the stricter regulations around becoming a verified host on Airbnb. To become a verified host, one needs to provide Airbnb with government ID, whereas Superhosts in contrast only need to have fulfilled requirements to do with minimum response rates, minimum cancellation rates, and minimum ratings.
Relationship between a verified host and reviews: Verified hosts receive a much higher volume of reviews than non-verified hosts. This may be attributed to the fact that some verified hosts also require their guests to be verified, in which case they may be avid Airbnb users and be more likely to contribute regular reviews to the properties they stay in.
Relationship between a verified host and ratings: Consistent with the relationship between a verified host and the number of reviews, the cumulative ratings for a verified host are much higher than for a non-verified host. The relationship between a verified host and ratings is very much comparable to that between a verified host and reviews.
Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?
Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. Fill in the code below to create prop_type_simplified.
listings_2 %>%
group_by(property_type) %>%
summarise(num_property_type = count(property_type)) %>%
arrange(desc(num_property_type))# A tibble: 43 x 2
property_type num_property_type
<chr> <int>
1 Entire rental unit 867
2 Private room in residential home 527
3 Entire residential home 486
4 Entire condominium (condo) 334
5 Entire guest suite 283
6 Private room in rental unit 246
7 Room in boutique hotel 151
8 Private room in condominium (condo) 99
9 Room in hotel 59
10 Entire serviced apartment 52
# ... with 33 more rows
listings_3 <- listings_2 %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit", "Private room in residential home","Entire residential home","Entire condominium (condo)") ~ property_type,
TRUE ~ "Other"))Comment: The top 4 most common property types are, in order of most common to least common: entire rental unit, private room in residential home, entire residential home, and entire condominium (condo). These top 4 most common property types make up around 64% of all property listings.
listings_3 %>%
count(prop_type_simplified) %>%
arrange(desc(n)) # A tibble: 5 x 2
prop_type_simplified n
<chr> <int>
1 Other 1253
2 Entire rental unit 867
3 Private room in residential home 527
4 Entire residential home 486
5 Entire condominium (condo) 334
log_price_prop <- ggplot(listings_3, aes(x=log(price), color= prop_type_simplified)) +
geom_density()+
theme_bw()+
facet_wrap(~prop_type_simplified, nrow=1)+
labs(title = "Price distribution for different property types",
subtitle = "Density Plot",
x = "Log (Price) ",
y = "Density")+
theme(legend.position = "none")+
NULL
log_price_prop Comment: The prices for the property types of entire rental unit and private room in residential home are especially closely packed together. This may be attributed to the fact that these 2 property types are the most popular and have the highest number of listings, and thus have a lot more properties available at relatively similar prices.
data_for_correlation <-listings_3 %>%
select(availability_30, bedrooms, beds, host_listings_count, number_of_reviews_l30d, review_scores_rating)
correlation_matrix <- cor(data_for_correlation)
ggcorrplot(correlation_matrix, hc.order= TRUE, lab = TRUE, colors= c("#CB454A", "white", "#7DCD85"))+
labs(title = "Correlation Matrix")ggpairs(data_for_correlation)+
theme_bw()+
labs(title = "Relationship between selected variables",
subtitles = "Correlation with scatter and density plots") Comment: The most noteworthy thing to observe in this correlation matrix and the relationship between selected variables is the relatively high correlation between beds and bedrooms (correlation of 0.756). This is expected as in general, the greater the number of bedrooms, the greater the number of beds in a given property as well. This may result in some multicollinearity in our regression analysis which will result in less reliable statistical inferences. Other than this, the correlation between these selected variables are all relatively low. The only other exception is the relationship between availability_30 and review_scores_rating (correlation of -0.168). This also can be rationalised by the fact that the longer a property is available shows that demand is scarce for the property and that people might not rate that property highly. Nevertheless, this is still only very marginally correlated (correlation of -0.168), so we should not worry about this too much.
Visualizations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualizations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualization we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.
The following code, having downloaded a dataframe listings with all AirbnB listings in Milan, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet
leaflet(data = filter(listings_3, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.
Create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should you use for the regression model? Why?
Fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
review_scores_rating in terms of price_4_nights.prop_type_simplified in terms of price_4_nights.We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.
listings_model <- listings_3 %>%
filter(accommodates>=2, minimum_nights<=4) %>%
mutate(price_4_nights = price * 4) %>%
select(76,34,37,38,36,36,41,42,51,56,58,74,61,67,16,17,18,22,23,26,69,28,32,33,75)
colnames(listings_model) [1] "price_4_nights" "accommodates"
[3] "bedrooms" "beds"
[5] "bathrooms_text" "minimum_nights"
[7] "maximum_nights" "availability_30"
[9] "number_of_reviews" "number_of_reviews_l30d"
[11] "reviews_per_month" "review_scores_rating"
[13] "review_scores_value" "host_response_rate"
[15] "host_acceptance_rate" "host_is_superhost"
[17] "host_listings_count" "host_total_listings_count"
[19] "host_identity_verified" "instant_bookable"
[21] "neighbourhood_cleansed" "property_type"
[23] "room_type" "prop_type_simplified"
favstats(listings_model$price_4_nights) min Q1 median Q3 max mean sd n missing
140 472 704 1180 1e+05 1106.583 2688.563 1685 0
listings_model_1 <- listings_model %>%
filter(price_4_nights <= 1106.58+2688.56)ggplot(listings_model_1,aes(x = price_4_nights)) +
geom_density()+
labs(title = "Price distribution for four nights",
subtitle = "Density Plot",
x = "Price for 4 nights",
y = "Density")+
NULL# the density plot without log is quite right-skewed, so we choose to log our Y.
ggplot(listings_model_1,aes(x = log(price_4_nights))) +
geom_density()+
labs(title = "Price distribution for four nights",
subtitle = "Density Plot",
x = "Log(Price for 4 nights)",
y = "Density")+
NULL# it looks better nowmodel1 <- lm(log(price_4_nights)~prop_type_simplified+number_of_reviews+review_scores_rating,data = listings_model_1)
summary(model1)
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating, data = listings_model_1)
Residuals:
Min 1Q Median 3Q Max
-1.60967 -0.33077 -0.00579 0.31766 1.84554
Coefficients:
Estimate Std. Error
(Intercept) 6.7545780 0.1691724
prop_type_simplifiedEntire rental unit -0.1905313 0.0559638
prop_type_simplifiedEntire residential home 0.0148465 0.0552276
prop_type_simplifiedOther -0.6576207 0.0498154
prop_type_simplifiedPrivate room in residential home -1.0489163 0.0547899
number_of_reviews -0.0011476 0.0001068
review_scores_rating 0.0913308 0.0332170
t value Pr(>|t|)
(Intercept) 39.927 < 2e-16 ***
prop_type_simplifiedEntire rental unit -3.405 0.000679 ***
prop_type_simplifiedEntire residential home 0.269 0.788099
prop_type_simplifiedOther -13.201 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -19.144 < 2e-16 ***
number_of_reviews -10.748 < 2e-16 ***
review_scores_rating 2.750 0.006034 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4936 on 1624 degrees of freedom
Multiple R-squared: 0.4186, Adjusted R-squared: 0.4165
F-statistic: 194.9 on 6 and 1624 DF, p-value: < 2.2e-16
# When a categorical variable has k levels, we include (k-1) in the regression model and the one left outside acts as our baseline (or zero).
# in this case, "Entire condominium (condo)" will be the baseline, and the intercept 6.46 is the mean cost of condo.
#The slope of Entire rental unit is -0.137 – people in Entire rental unit cost on average 0.137 cheaper than the baseline type of Condo
autoplot(model1)+
theme_bw()# to check the residuals
# there is a pattern in the top left graph,meaning that there are variables in our model that are currently unaccounted for the Y.
car::vif(model1) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.084174 4 1.010153
number_of_reviews 1.042724 1 1.021139
review_scores_rating 1.047398 1 1.023425
Comment: Looking at the Residuals vs Fitted graph, from fitted values 5.5 to 6.5, residuals bounce around the 0 line randomly and do not suggest any presence of positive or negative correlation with fitted values. The former behaviour suggests that the variance of errors is zero. The Normal Q-Q graph shows us a linear positive line which suggests that the data behaves as the normal assumption distribution used for the analysis. This may come across as natural logarithms were applied to the data. Regarding the scale location graph, the blue line is approximately horizontal, indicating that the average magnitude of residuals is not changing as a function of fitted values. Nevertheless, spread around the blue line widens as fitted values increase up to 6.5. Afterwards, the spread decreases and starts increasing again until 7.5. The former behaviour may be an indication of heteroskedasticity. Finally, the residuals vs leverage graph shows that there are some observations that may affect the predictability of the model. As a consequence, not taking into account these data points in our model would improve predictability.
model2 <- lm(log(price_4_nights)~prop_type_simplified+number_of_reviews+review_scores_rating+room_type,data = listings_model_1)
summary(model2) #under 95% CI, room_type is significant
Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
review_scores_rating + room_type, data = listings_model_1)
Residuals:
Min 1Q Median 3Q Max
-1.48015 -0.32051 -0.03474 0.28521 1.64810
Coefficients:
Estimate Std. Error
(Intercept) 6.8540134 0.1608274
prop_type_simplifiedEntire rental unit -0.1989411 0.0519599
prop_type_simplifiedEntire residential home 0.0080909 0.0512741
prop_type_simplifiedOther -0.4877513 0.0499914
prop_type_simplifiedPrivate room in residential home -0.7698072 0.0621849
number_of_reviews -0.0010125 0.0001002
review_scores_rating 0.0701141 0.0316186
room_typeHotel room 0.1618588 0.0824503
room_typePrivate room -0.2930125 0.0356993
room_typeShared room -1.3441640 0.0923743
t value Pr(>|t|)
(Intercept) 42.617 < 2e-16 ***
prop_type_simplifiedEntire rental unit -3.829 0.000134 ***
prop_type_simplifiedEntire residential home 0.158 0.874637
prop_type_simplifiedOther -9.757 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -12.379 < 2e-16 ***
number_of_reviews -10.104 < 2e-16 ***
review_scores_rating 2.217 0.026728 *
room_typeHotel room 1.963 0.049804 *
room_typePrivate room -8.208 4.54e-16 ***
room_typeShared room -14.551 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4583 on 1621 degrees of freedom
Multiple R-squared: 0.4999, Adjusted R-squared: 0.4971
F-statistic: 180 on 9 and 1621 DF, p-value: < 2.2e-16
autoplot(model2)+
theme_bw()car::vif(model2) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.501720 4 1.121450
number_of_reviews 1.065645 1 1.032301
review_scores_rating 1.101116 1 1.049341
room_type 2.536817 3 1.167835
Comment: As in Model 1, residuals bounce around line 0 with a spread that widens as fitted values increase. As a consequence, the graph suggests that the variance of errors is zero. The normal Q-Q Plot has some data points below the linear positive line but not significantly upper. Therefore, as in model 1, data behaves as expected, according to the normal distribution assumption. Regarding the scale location graph, the blue line is approximately horizontal but slightly positive, indicating that standardized residuals change as fitted values increase. In addition, the spread around the line also increases as fitted values do, suggesting presence of heteroskedasticity. Finally, the residuals vs leverage graph shows the presence of data points affecting predictability of the model. Therefore, not taking into account those data points would help to improve the robustness of the model.
Our dataset has many more variables, so here are some ideas on how you can extend your analysis
bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?listings_model_2 <- listings_model_1 %>%
mutate(bathrooms_num = case_when(bathrooms_text=="1 shared bath"~1,
bathrooms_text=="3 baths"~3,
bathrooms_text=="1 private bath"~1,
bathrooms_text=="1 bath"~1,
bathrooms_text=="1.5 shared baths"~1.5,
bathrooms_text=="2.5 shared baths"~2.5,
bathrooms_text=="2 baths"~2,
bathrooms_text=="1.5 baths"~1.5,
bathrooms_text=="2.5 baths"~2.5,
bathrooms_text=="0 baths"~0,
bathrooms_text=="2 shared baths"~2,
bathrooms_text=="4 baths"~4,
bathrooms_text=="3 shared baths"~3,
bathrooms_text=="Half-bath"~0.5,
bathrooms_text=="Shared half-bath"~0.5,
bathrooms_text=="private half-bath"~0.5,
bathrooms_text=="3.5 baths"~3.5,
bathrooms_text=="3.5 shared baths"~3.5,
bathrooms_text=="5 baths"~5,
bathrooms_text=="4.5 baths"~4.5,
bathrooms_text=="4 shared baths"~4,
bathrooms_text=="5 shared baths"~5,
bathrooms_text=="5.5 baths"~5.5,
bathrooms_text=="8 shared baths"~8))
model3 <- lm(log(price_4_nights) ~ bathrooms_num + bedrooms + beds + accommodates,data = listings_model_2)
summary(model3)
Call:
lm(formula = log(price_4_nights) ~ bathrooms_num + bedrooms +
beds + accommodates, data = listings_model_2)
Residuals:
Min 1Q Median 3Q Max
-1.67221 -0.34389 -0.01925 0.34023 1.80524
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.78725 0.03241 178.577 < 2e-16 ***
bathrooms_num 0.03775 0.02703 1.397 0.163
bedrooms 0.42280 0.02940 14.379 < 2e-16 ***
beds -0.10790 0.01610 -6.702 2.83e-11 ***
accommodates 0.09773 0.01357 7.204 8.96e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.493 on 1601 degrees of freedom
(25 observations deleted due to missingness)
Multiple R-squared: 0.3919, Adjusted R-squared: 0.3904
F-statistic: 258 on 4 and 1601 DF, p-value: < 2.2e-16
car::vif(model3)bathrooms_num bedrooms beds accommodates
1.514122 3.356921 2.829508 4.061990
Comment: According to our model, the number of bathrooms is not a significant predictor of price. This may be the case as the number of bathrooms is not a differentiator for a customer when deciding to book an Airbnb. However, the number of bedrooms, beds, and size of house may influence the experience of the customer, and therefore, the final price of the Airbnb. Our model suggests that those 3 variables are significant. Finally, bedrooms and beds is a clear example of collinearity as obviously, most of the time, having more bedrooms in a flat/apartment requires more beds (correlation of 0.76). Therefore, using both variables in our model would not add value.
(host_is_superhost) command a pricing premium, after controlling for other variables?model4 <- lm(log(price_4_nights) ~ host_is_superhost,data = listings_model_2)
summary(model4)
Call:
lm(formula = log(price_4_nights) ~ host_is_superhost, data = listings_model_2)
Residuals:
Min 1Q Median 3Q Max
-1.64268 -0.43792 -0.03324 0.43828 1.64589
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.58432 0.02675 246.188 <2e-16 ***
host_is_superhostTRUE -0.01954 0.03338 -0.585 0.558
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6463 on 1629 degrees of freedom
Multiple R-squared: 0.0002103, Adjusted R-squared: -0.0004034
F-statistic: 0.3427 on 1 and 1629 DF, p-value: 0.5584
Comment: After controlling for other variables, the model shows that superhosts do not provide a pricing premium for airbnbs. This may be the case as customers do not consider this characteristic essential and they focus more on other aspects of the flat/apartment. Therefore, landlords do not increase prices because they are superhosts.
instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?model5 <- lm(log(price_4_nights) ~ instant_bookable,data = listings_model_2)
summary(model5)
Call:
lm(formula = log(price_4_nights) ~ instant_bookable, data = listings_model_2)
Residuals:
Min 1Q Median 3Q Max
-1.71706 -0.44077 -0.03789 0.40320 1.64781
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.68687 0.02034 328.82 <2e-16 ***
instant_bookableTRUE -0.28144 0.03180 -8.85 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6314 on 1629 degrees of freedom
Multiple R-squared: 0.04588, Adjusted R-squared: 0.04529
F-statistic: 78.33 on 1 and 1629 DF, p-value: < 2.2e-16
Comment: After controlling for other variables, instant_bookable is a significant predictor. The negative coefficient may suggest that landlords are willing to have their flats/apartment booked as soon as possible and for that reason they price their flats at a discount vs non-instantly bookable flats. On the other hand, customers may think it is convenient to be able to book an Airbnb immediately (a feature common for hotel bookings).
neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighborhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighborhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variable neighbourhood_simplified and determine whether location is a predictor of price_4_nights#unique(listings$neighbourhood_cleansed,incomparables=FALSE)
listings_model_3 <- listings_model_2 %>%
mutate(neighbourhood_simplified = ifelse(neighbourhood_cleansed %in% c(
"Financial District",
"Presidio Heights",
"Seacliff",
"Haight-Ashbury",
"Nob Hill",
"Diamond Heights",
"West of Twin Peaks",
"Russian Hill",
"Noe Valley",
"Golden Gate Park"), "Prime","Other"))
model6 <- lm(price_4_nights ~ neighbourhood_simplified, data = listings_model_3)
summary(model6)
Call:
lm(formula = price_4_nights ~ neighbourhood_simplified, data = listings_model_3)
Residuals:
Min 1Q Median 3Q Max
-799.3 -423.5 -191.7 228.3 2808.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 871.65 16.93 51.499 <2e-16 ***
neighbourhood_simplifiedPrime 79.61 43.15 1.845 0.0652 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 628.8 on 1629 degrees of freedom
Multiple R-squared: 0.002086, Adjusted R-squared: 0.001473
F-statistic: 3.405 on 1 and 1629 DF, p-value: 0.06518
Comment: There are 36 neighbourhoods in San Francisco. Using our city knowledge, we identify 10 good neighbourhoods and categorize them as “Prime”, and other neighbourhoods fall in the category of “Other”. Then we analyse this simplified neighbourhood variable in model 6 to explore the effect of neighbourhood locations on log price for 4 nights. In this model, the intercept is 871.65 with a standard deviation of 16.93. The neighbourhood variable has a coefficient of 79.61 and a standard deviation of 43.15, meaning that locating in one of the “Prime” neighbourhoods will increase the log price by 79.61, which matches our expectation. While the intercept is significant (a high t-value of 51.499), the coefficient is only significant at a significance level of 0.05 (t-value = 1.845 <2).
avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?model7 <- lm(log(price_4_nights) ~ availability_30,data = listings_model_3)
summary(model7)
Call:
lm(formula = log(price_4_nights) ~ availability_30, data = listings_model_3)
Residuals:
Min 1Q Median 3Q Max
-1.66035 -0.43728 -0.05451 0.44741 1.76889
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.70058 0.02327 287.960 < 2e-16 ***
availability_30 -0.01232 0.00164 -7.514 9.43e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6355 on 1629 degrees of freedom
Multiple R-squared: 0.0335, Adjusted R-squared: 0.0329
F-statistic: 56.46 on 1 and 1629 DF, p-value: 9.429e-14
Comment: In model 7, we explore the effect of availability on the log price for 4 nights. In this model, the intercept is 6.70058 with a standard deviation of 0.02327. The availability variable has a coefficient of -0.01232 and a standard deviation of 0.00164, meaning that for one increase in availability, there will be -0.01232 decrease the log price for 4 nights. The rationale here is that high availability indicates that the property is less popular and thus has lower price. Both the intercept and coefficient are very significant.
Based on our analysis in this model, we conclude that availability_30 is a significant predictor after controlling for other variables.
model8 <- lm(log(price_4_nights) ~ reviews_per_month,data = listings_model_3)
summary(model8)
Call:
lm(formula = log(price_4_nights) ~ reviews_per_month, data = listings_model_3)
Residuals:
Min 1Q Median 3Q Max
-1.64488 -0.43373 -0.03517 0.44219 1.61507
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.604684 0.017747 372.158 < 2e-16 ***
reviews_per_month -0.009082 0.002165 -4.195 2.87e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6429 on 1629 degrees of freedom
Multiple R-squared: 0.01069, Adjusted R-squared: 0.01008
F-statistic: 17.6 on 1 and 1629 DF, p-value: 2.872e-05
Comment: In this model, we analyse the effect of reviews per month on log price for 4 nights. The variable for reviews per month has an intercept of 6.604 and a standard deviation of 0.017. This equivalent t value of this test suggests that the null hypothesis can be easily rejected at any level of significance. The interpretation indicates that when the reviews per month variable takes on a value of ‘0’, the average log price for 4 nights is equal to 6.604.
The variable reviews per month has an estimate of -0.009 and a standard deviation of 0.0021. The t value is also significant and indicates that for every review the log price for 4 nights with change -0.009.
# model9 contains all the significant variables we explored before
model9 <- lm(log(price_4_nights) ~ availability_30 + reviews_per_month + neighbourhood_simplified +
instant_bookable + host_is_superhost + bedrooms + beds + bathrooms_num + prop_type_simplified + number_of_reviews + review_scores_rating + number_of_reviews_l30d + room_type,data = listings_model_3)
summary(model9)
Call:
lm(formula = log(price_4_nights) ~ availability_30 + reviews_per_month +
neighbourhood_simplified + instant_bookable + host_is_superhost +
bedrooms + beds + bathrooms_num + prop_type_simplified +
number_of_reviews + review_scores_rating + number_of_reviews_l30d +
room_type, data = listings_model_3)
Residuals:
Min 1Q Median 3Q Max
-1.33287 -0.26347 -0.03088 0.25644 1.44196
Coefficients:
Estimate Std. Error
(Intercept) 6.149e+00 1.514e-01
availability_30 4.444e-03 1.202e-03
reviews_per_month 3.755e-05 1.561e-03
neighbourhood_simplifiedPrime 1.282e-01 2.876e-02
instant_bookableTRUE -1.043e-01 2.220e-02
host_is_superhostTRUE 6.024e-02 2.396e-02
bedrooms 2.763e-01 2.523e-02
beds 1.407e-02 1.385e-02
bathrooms_num 6.439e-02 2.253e-02
prop_type_simplifiedEntire rental unit -1.571e-01 4.595e-02
prop_type_simplifiedEntire residential home -1.827e-01 4.651e-02
prop_type_simplifiedOther -3.257e-01 4.514e-02
prop_type_simplifiedPrivate room in residential home -6.247e-01 5.615e-02
number_of_reviews -6.055e-04 9.692e-05
review_scores_rating 7.519e-02 2.934e-02
number_of_reviews_l30d -2.051e-02 3.695e-03
room_typeHotel room 3.395e-01 7.611e-02
room_typePrivate room -2.112e-01 3.405e-02
room_typeShared room -1.500e+00 1.635e-01
t value Pr(>|t|)
(Intercept) 40.612 < 2e-16 ***
availability_30 3.697 0.000226 ***
reviews_per_month 0.024 0.980810
neighbourhood_simplifiedPrime 4.457 8.91e-06 ***
instant_bookableTRUE -4.700 2.83e-06 ***
host_is_superhostTRUE 2.514 0.012045 *
bedrooms 10.951 < 2e-16 ***
beds 1.016 0.309979
bathrooms_num 2.857 0.004327 **
prop_type_simplifiedEntire rental unit -3.419 0.000645 ***
prop_type_simplifiedEntire residential home -3.928 8.93e-05 ***
prop_type_simplifiedOther -7.216 8.27e-13 ***
prop_type_simplifiedPrivate room in residential home -11.125 < 2e-16 ***
number_of_reviews -6.247 5.36e-10 ***
review_scores_rating 2.563 0.010464 *
number_of_reviews_l30d -5.552 3.31e-08 ***
room_typeHotel room 4.460 8.76e-06 ***
room_typePrivate room -6.203 7.05e-10 ***
room_typeShared room -9.179 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4046 on 1587 degrees of freedom
(25 observations deleted due to missingness)
Multiple R-squared: 0.5941, Adjusted R-squared: 0.5895
F-statistic: 129 on 18 and 1587 DF, p-value: < 2.2e-16
autoplot(model9)+
theme_bw()car::vif(model9) GVIF Df GVIF^(1/(2*Df))
availability_30 1.272198 1 1.127917
reviews_per_month 1.312397 1 1.145599
neighbourhood_simplified 1.059505 1 1.029322
instant_bookable 1.172699 1 1.082913
host_is_superhost 1.302781 1 1.141395
bedrooms 3.669330 1 1.915550
beds 3.111778 1 1.764023
bathrooms_num 1.562141 1 1.249856
prop_type_simplified 3.456565 4 1.167698
number_of_reviews 1.257991 1 1.121602
review_scores_rating 1.213977 1 1.101806
number_of_reviews_l30d 1.457407 1 1.207231
room_type 4.709159 3 1.294664
Comment: After regressing the reviews per month variable on log price of 4 nights with all our significant variables, we can observe several tendencies. In the first residuals vs fitted plot, we can observe that this presents the characteristics of an appropriate model. The residuals of the model float around the 0 line, suggesting the linear relationship is correct. The overall form seems to be horizontal, which indicates the variances of the error terms are equal.
The normal Q-Q plot follows the plot of a normal distribution and does not present any outliers along the distribution line. The scale location plot presents the blue line horizontal across the plot. This indicates homoscedasticity is present in this regression model. That is, the spread of the residuals are roughly equal at all fitted values. The residuals seem to be scattered randomly around the blue line, although slightly more present above the line than under. It would be reasonable to assume they all have similar variability at all fitted values. Throughout the residuals vs leverage plot, we can perceive that the spread of the residuals tends to decrease as leverage increases, indicating the possibility of heteroscedasticity. The spread of residuals should remain constant regardless of the amount of leverage. The residuals all seem to be close to the blue line, indicating that no individual residual is having a large impact on the model.
# the model we choose:
model10<- lm(log(price_4_nights) ~ availability_30 + neighbourhood_simplified + instant_bookable + bedrooms + prop_type_simplified + number_of_reviews + review_scores_rating + room_type,data = listings_model_3)
summary(model10)
Call:
lm(formula = log(price_4_nights) ~ availability_30 + neighbourhood_simplified +
instant_bookable + bedrooms + prop_type_simplified + number_of_reviews +
review_scores_rating + room_type, data = listings_model_3)
Residuals:
Min 1Q Median 3Q Max
-1.40124 -0.27617 -0.03625 0.25856 1.50340
Coefficients:
Estimate Std. Error
(Intercept) 6.200e+00 1.514e-01
availability_30 3.419e-03 1.187e-03
neighbourhood_simplifiedPrime 1.334e-01 2.846e-02
instant_bookableTRUE -1.263e-01 2.181e-02
bedrooms 3.200e-01 1.696e-02
prop_type_simplifiedEntire rental unit -1.624e-01 4.640e-02
prop_type_simplifiedEntire residential home -1.805e-01 4.689e-02
prop_type_simplifiedOther -3.405e-01 4.538e-02
prop_type_simplifiedPrivate room in residential home -6.323e-01 5.591e-02
number_of_reviews -7.539e-04 9.036e-05
review_scores_rating 8.025e-02 2.876e-02
room_typeHotel room 3.282e-01 7.500e-02
room_typePrivate room -1.993e-01 3.330e-02
room_typeShared room -1.367e+00 8.561e-02
t value Pr(>|t|)
(Intercept) 40.964 < 2e-16 ***
availability_30 2.881 0.004016 **
neighbourhood_simplifiedPrime 4.686 3.02e-06 ***
instant_bookableTRUE -5.791 8.39e-09 ***
bedrooms 18.868 < 2e-16 ***
prop_type_simplifiedEntire rental unit -3.499 0.000479 ***
prop_type_simplifiedEntire residential home -3.849 0.000123 ***
prop_type_simplifiedOther -7.504 1.02e-13 ***
prop_type_simplifiedPrivate room in residential home -11.309 < 2e-16 ***
number_of_reviews -8.343 < 2e-16 ***
review_scores_rating 2.790 0.005331 **
room_typeHotel room 4.376 1.29e-05 ***
room_typePrivate room -5.984 2.68e-09 ***
room_typeShared room -15.969 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4087 on 1617 degrees of freedom
Multiple R-squared: 0.6031, Adjusted R-squared: 0.5999
F-statistic: 189 on 13 and 1617 DF, p-value: < 2.2e-16
autoplot(model10)+
theme_bw()car::vif(model10) GVIF Df GVIF^(1/(2*Df))
availability_30 1.265461 1 1.124927
neighbourhood_simplified 1.029556 1 1.014670
instant_bookable 1.122926 1 1.059682
bedrooms 1.634221 1 1.278367
prop_type_simplified 3.303017 4 1.161085
number_of_reviews 1.089254 1 1.043673
review_scores_rating 1.145368 1 1.070219
room_type 3.002912 3 1.201131
Comment: Model 10 presents almost identical results for the first three plots previously mentioned. The exception lies with the residuals vs leverage plot. In this plot, we can see that leverage does not have an impact on residuals for almost all the residuals present. There seem to be some residuals at the end of the blue line which could be outliers impacting the results of the model. Overall, this plot indicated the model is homoskedastic.
huxtable that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.| Model 1 | Model 2 | Model 3 | Combined Model | Final Model | |
|---|---|---|---|---|---|
| (Intercept) | 6.755 | 6.854 | 5.787 | 6.149 | 6.200 |
| (0.169) | (0.161) | (0.032) | (0.151) | (0.151) | |
| prop_type_simplifiedEntire rental unit | -0.191 | -0.199 | -0.157 | -0.162 | |
| (0.056) | (0.052) | (0.046) | (0.046) | ||
| prop_type_simplifiedEntire residential home | 0.015 | 0.008 | -0.183 | -0.180 | |
| (0.055) | (0.051) | (0.047) | (0.047) | ||
| prop_type_simplifiedOther | -0.658 | -0.488 | -0.326 | -0.341 | |
| (0.050) | (0.050) | (0.045) | (0.045) | ||
| prop_type_simplifiedPrivate room in residential home | -1.049 | -0.770 | -0.625 | -0.632 | |
| (0.055) | (0.062) | (0.056) | (0.056) | ||
| number_of_reviews | -0.001 | -0.001 | -0.001 | -0.001 | |
| (0.000) | (0.000) | (0.000) | (0.000) | ||
| review_scores_rating | 0.091 | 0.070 | 0.075 | 0.080 | |
| (0.033) | (0.032) | (0.029) | (0.029) | ||
| room_typeHotel room | 0.162 | 0.339 | 0.328 | ||
| (0.082) | (0.076) | (0.075) | |||
| room_typePrivate room | -0.293 | -0.211 | -0.199 | ||
| (0.036) | (0.034) | (0.033) | |||
| room_typeShared room | -1.344 | -1.500 | -1.367 | ||
| (0.092) | (0.163) | (0.086) | |||
| bathrooms_num | 0.038 | 0.064 | |||
| (0.027) | (0.023) | ||||
| bedrooms | 0.423 | 0.276 | 0.320 | ||
| (0.029) | (0.025) | (0.017) | |||
| beds | -0.108 | 0.014 | |||
| (0.016) | (0.014) | ||||
| accommodates | 0.098 | ||||
| (0.014) | |||||
| availability_30 | 0.004 | 0.003 | |||
| (0.001) | (0.001) | ||||
| reviews_per_month | 0.000 | ||||
| (0.002) | |||||
| neighbourhood_simplifiedPrime | 0.128 | 0.133 | |||
| (0.029) | (0.028) | ||||
| instant_bookableTRUE | -0.104 | -0.126 | |||
| (0.022) | (0.022) | ||||
| host_is_superhostTRUE | 0.060 | ||||
| (0.024) | |||||
| number_of_reviews_l30d | -0.021 | ||||
| (0.004) | |||||
| Number of observations | 1631 | 1631 | 1606 | 1606 | 1631 |
| Adj. R Squared | 0.416 | 0.497 | 0.390 | 0.590 | 0.600 |
| Residual SE | 0.494 | 0.458 | 0.493 | 0.405 | 0.409 |
price_4_nights.data_for_predict <- listings_model_3 %>%
filter(room_type == "Private room",
number_of_reviews_l30d>=10,
review_scores_rating>=0.9)
#data_for_predict <- data.frame(availability_30=30, neighbourhood_simplified ="Prime" , instant_bookable=TRUE, bedrooms=1 , prop_type_simplified="Private room in residential home" , number_of_reviews=10 , review_scores_rating=4 , room_type="Private room")
data_for_predict# A tibble: 15 x 26
price_4_nights accommodates bedrooms beds bathrooms_text minimum_nights
<dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 392 2 1 1 1 private bath 1
2 260 2 1 1 1 private bath 1
3 280 4 1 1 1 private bath 1
4 300 4 1 2 1 private bath 1
5 520 2 1 1 1 private bath 1
6 248 3 1 1 1 shared bath 1
7 228 2 1 1 1 shared bath 1
8 364 2 1 1 1 private bath 1
9 376 2 1 1 1 shared bath 2
10 344 2 1 1 1 private bath 1
11 372 2 1 1 1 shared bath 2
12 612 2 1 1 1 private bath 1
13 328 2 1 1 1 shared bath 1
14 552 2 1 1 1 private bath 1
15 336 2 1 1 5 shared baths 2
# ... with 20 more variables: maximum_nights <dbl>, availability_30 <dbl>,
# number_of_reviews <dbl>, number_of_reviews_l30d <dbl>,
# reviews_per_month <dbl>, review_scores_rating <dbl>,
# review_scores_value <dbl>, host_response_rate <dbl>,
# host_acceptance_rate <dbl>, host_is_superhost <lgl>,
# host_listings_count <dbl>, host_total_listings_count <dbl>,
# host_identity_verified <lgl>, instant_bookable <lgl>, ...
# When we plug this multi-row data frame into predict(), it'll generate a
# prediction for each row
model_prediction <- data.frame(predict(model10, newdata = data_for_predict, interval = "confidence")) %>%
mutate(Price = exp(fit),
CI_lower = exp(lwr),
CI_upper = exp(upr)) %>%
select(4,5,6)
model_prediction %>%
kbl() %>%
kable_classic(full_width = F, html_font = "Cambria") %>%
kable_styling(bootstrap_options = c("striped", "condensed"))| Price | CI_lower | CI_upper |
|---|---|---|
| 321.6923 | 287.8543 | 359.5080 |
| 295.7143 | 269.9815 | 323.8997 |
| 337.0595 | 318.3466 | 356.8724 |
| 332.2504 | 312.6587 | 353.0697 |
| 359.5702 | 338.2829 | 382.1972 |
| 357.3532 | 338.6215 | 377.1211 |
| 401.1296 | 380.4416 | 422.9426 |
| 447.1335 | 423.1091 | 472.5222 |
| 367.4627 | 345.5347 | 390.7823 |
| 444.6029 | 408.1545 | 484.3061 |
| 525.9696 | 498.0300 | 555.4766 |
| 529.3585 | 500.1927 | 560.2250 |
| 462.7024 | 435.9064 | 491.1456 |
| 517.2307 | 488.5524 | 547.5924 |
| 549.0524 | 518.0488 | 581.9115 |
Comment: The inputs of the model are a private room in a 1 bedroom residential house in a prime neighbourhood. The constraints are a minimum of 10 reviews and an average rating of more than 4. The model predicts a confidence interval of 410.3619 - 495.8104, with an average estimate price of 451.0672.
Things that can be improved in the future: